Return to Overview

In this analysis, we are evaluating the performance of a series of panel in a solar power plant. We want to see if there are any issues with any of the panels that may indicate needed maintenance, or panels that are sub-optimal.
Data comes from Kaggle


As usual, we begin by inspecting the data and seeing if there are any missing values

##   date_time            plant_id        source_key           dc_power    
##  Length:68778       Min.   :4135001   Length:68778       Min.   :    0  
##  Class :character   1st Qu.:4135001   Class :character   1st Qu.:    0  
##  Mode  :character   Median :4135001   Mode  :character   Median :  429  
##                     Mean   :4135001                      Mean   : 3147  
##                     3rd Qu.:4135001                      3rd Qu.: 6367  
##                     Max.   :4135001                      Max.   :14471  
##     ac_power        daily_yield    total_yield     
##  Min.   :   0.00   Min.   :   0   Min.   :6183645  
##  1st Qu.:   0.00   1st Qu.:   0   1st Qu.:6512003  
##  Median :  41.49   Median :2659   Median :7146685  
##  Mean   : 307.80   Mean   :3296   Mean   :6978712  
##  3rd Qu.: 623.62   3rd Qu.:6274   3rd Qu.:7268706  
##  Max.   :1410.95   Max.   :9163   Max.   :7846821
## [1] 22

Depending on the time of day, power generation varies widely. This means looking at averages across the days is not going to be informative. Means are too prone to outliers, so we’ll focus on median power generation to get an overlook at the individual panels.

We’re also dealing with 22 individual panels, which can make visualizations a little tricky; they’re liable to get a bit crowded. Case in point:


It isn’t horrible; I may use it to portray those two panels that are clearly under-performing to a non-technical group. But we can do better if we want to get more information:


By plotting the mean against the median, we can observe just how much outliers in the data effect each monitor, and how the large variance in the data skews the mean away from the median. And down there in the left corner are our two potentially problematic panels that we want to keep in mind, and dig down into a little later in this analysis. For now, we want to find any potential issues that have arisen in the past ~month of data that we have.

#-----------Analyze power generation to identify issues------
aggDF <- powerData %>%
  mutate(hour = substr(date_time, 12, 13),
         hourMinute = substr(date_time, 12, 16),
         day =  as.Date(substr(date_time, 1, 10), format='%d-%m-%Y')
         ) %>%
  group_by(source_key, hourMinute) %>%
  mutate('hourMinuteMean' = mean(ac_power),
         'hourMinuteStd' = sd(ac_power)
         ) %>%
  ungroup()

issues <- aggDF %>%
  filter(ac_power < (hourMinuteMean - 3*hourMinuteStd)) %>%
  mutate('changeFromAverage' = ((ac_power-hourMinuteMean)/hourMinuteMean)*100)
length(unique(issues$source_key))
## [1] 19

I’m including the code here to make it easier to see the logic process I went through. I took the date_time column and did a little feature engineering, extrapolating the hour of the day, and the day itself out into their own columns. My thought process here is this:

The time of day effects the power generation greatly, but the day itself should not effect power generation. So I want to compare similar times of day together and aggregate on that level, not the day-level. I want to theoretically set this up to be able to be some kind of live-alerting system.

I then aggregate the average for each hour of the day; average is okay in this case since, as stated above, the power production day-to-day at specific times shouldn’t be widely different. I also get the standard deviation for each hour of the day to help identify outliers.
At that point, I can filter out every monitor that produced less than 3 standard deviations of the average power production for each specific date_time.Looking at how many panels experienced some sort of issue, which is 19, is pretty much all of them. That seems a bit fishy to me, so I break down the unique hours that issues arose in.

## [1] "09" "10" "12" "13" "11" "06"

And there it is: a small timeframe in which the oddities occur in. My guess is that these hours are when the sun is rising and when the sun is setting, so only a small subsection of the panels are getting sunlight, and at various intensities. The odd one out here is the 6th hour, so I dug into that a little more:

## # A tibble: 2 Ă— 5
##   date_time        source_key      ac_power hourMinuteMean changeFromAverage
##   <chr>            <chr>              <dbl>          <dbl>             <dbl>
## 1 17-06-2020 06:45 bvBOhCH3iADSZry        0           101.              -100
## 2 17-06-2020 06:45 iCRJl6heRkivqQ3        0           107.              -100

Oh look! One of those is our little sub-optimal friend from above. Seems there are a few interesting indicators for that panel. There is another one that had no power production at the same time as well.

Seems a good enough time to look into sub-optimal performing panels.
I do this by defining a panel as an under-performer if its median power production is lower than the overall median power production. I then plot the number of these under-performing pannels over time.


The result tells us that the number of under-performing panels has improved by a pretty large amount in June, compared to May. Wins like this are always good to find. Builds team morale (and makes the stakeholders happy. Mostly that.).

Alright, last thing I want to do is look at those two panels from before. I want to know if they’ve always been under-performing, or if it started at a specific date.


Looks like they have pretty consistently been under performing compared to the average of the others. That probably means it isn’t an issue of cleaning them off, but rather they could just be faulty.

In a future analysis, I will build out a model that uses the associated weather data to predict the power generation.